Skip to content

test(engine): add regression test for alarm-during-sleep-transition race#4755

Draft
NathanFlurry wants to merge 1 commit intosleep-cleanup/09-rust-sleep-testsfrom
sleep-cleanup/10-alarm-overdue-during-sleep-test
Draft

test(engine): add regression test for alarm-during-sleep-transition race#4755
NathanFlurry wants to merge 1 commit intosleep-cleanup/09-rust-sleep-testsfrom
sleep-cleanup/10-alarm-overdue-during-sleep-test

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 24, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

PR Review: Regression Test for Alarm-During-Sleep-Transition Race

Overview

Adds a single regression test alarm_overdue_during_sleep_transition_fires_via_reallocation to guard against the alarm-during-sleep-transition race where a near-overdue alarm is silently dropped when Decision::Sleep clears state.alarm_ts without checking for overdue alarms.


Issues

1. #[ignore] makes this a non-functional regression guard

The test never runs in CI. If the Decision::Sleep overdue-alarm reallocation path is later reverted, no CI gate will catch it. The other ignored tests in this file (envoy_at_max_capacity, kv_empty_value) are ignored for infrastructure-capacity reasons; this one is different because the fix being guarded is in production code.

Suggestion: Either remove #[ignore] if the test is reliable enough (consider bumping the alarm offset to 500ms+ for more margin), or explicitly track it as a deferred follow-up.

2. Potential lifecycle event subscription race

The test subscribes to lifecycle events after ready_rx.await but without first confirming the actor entered sleep. With a 100ms alarm offset, the alarm can fire and the actor can restart at generation 1 before subscribe_lifecycle_events() is called, causing the test to hang waiting for a missed event.

The existing alarm_fires_at_correct_time test uses a 5000ms alarm and explicitly waits for wait_for_actor_sleep before subscribing (with a comment explaining why). The new test skips both guards.

Suggestion: Call runner.subscribe_lifecycle_events() before ready_rx.await, or wait for sleep confirmation before subscribing.

3. Delta documentation in comments

The doc comment references the specific before/after state and file path: "Before the fix in actor2/runtime.rs, this window cleared state.alarm_ts..."

CLAUDE.md discourages delta-documenting comments. If actor2/runtime.rs is renamed or the fix moves, the comment becomes misleading.

Suggestion: Describe the invariant being tested rather than the historical before/after. E.g.: "Verifies that an alarm that becomes overdue during the sleep-shutdown transition is not silently dropped: the engine must detect and handle it via reallocation."

4. No sleep-entry assertion

The test verifies the actor wakes at generation 1 but never confirms it entered sleep first. If a regression caused the actor to stay awake and handle the alarm via the normal (non-overdue) path, the test would still pass at generation 1 without exercising the race-specific code path.

Suggestion: Add a wait_for_actor_sleep call (short timeout) before waiting for the wake, to confirm the sleep-transition path was actually exercised.


Positive Aspects

  • Reuses AlarmAndSleepOnceActor without adding new actor structs.
  • The scenario description is thorough and clearly explains the race.
  • Correct test file placement (tests/ directory, not inline #[cfg(test)]).
  • Consistent Arc<Mutex<Option<oneshot::Sender<()>>>> ready-signal pattern.

Summary

The race scenario is valid and worth guarding. Main concerns: (1) #[ignore] means no CI regression protection; (2) the 100ms alarm window creates a subscription race that could cause spurious hangs; (3) delta-style comments will rot as the codebase evolves; (4) without a sleep-entry check the test does not prove the race path was actually exercised.

@NathanFlurry NathanFlurry force-pushed the sleep-cleanup/10-alarm-overdue-during-sleep-test branch from 50500b5 to d151463 Compare April 24, 2026 13:16
@NathanFlurry NathanFlurry force-pushed the sleep-cleanup/09-rust-sleep-tests branch from ed93ef7 to 8d93174 Compare April 24, 2026 13:16
@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4755

All packages published as 0.0.0-pr.4755.867c305 with tag pr-4755.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-867c305
docker pull rivetdev/engine:full-867c305
Individual packages
npm install rivetkit@pr-4755
npm install @rivetkit/react@pr-4755
npm install @rivetkit/rivetkit-napi@pr-4755
npm install @rivetkit/workflow-engine@pr-4755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant